Predictive and Distributed Routing Balancing for HPC Clusters

نویسندگان

  • Carlos Núñez Castillo
  • Diego Lugones
  • Daniel Franco
چکیده

Current parallel applications in parallel computing systems require an interconnection network to provide low and bounded communication delays. Communication characteristics such as traffic pattern and communication load change over time and, eventually, they may exceed network capacity causing congestion and performance degradation. Congestion control based on adaptive routing should be applied in order to adapt quickly to changing traffic conditions. Studies of parallel applications show repetitive behavior and that they can be characterized by a set of representative phases. This work presents a Predictive and Distributed Routing Balancing technique (PR-DRB) to control network congestion based on adaptive traffic distribution. PR-DRB uses speculative routing based on application repetitiveness. PR-DRB monitors messages latencies on routers and logs solutions to congestion, to quickly respond in future similar situations. Experimental results show that the predictive approach could be used to improve performance.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Randomized Load-balanced Routing for Fat-tree Networks

Fat-tree networks have been widely adopted to High Performance Computing (HPC) clusters and to Data Center Networks (DCN). These parallel systems usually have a large number of servers and hosts, which generate large volumes of highly-volatile traffic. Thus, distributed load-balancing routing design becomes critical to achieve high bandwidth utilization, and low-latency packet delivery. Existin...

متن کامل

Designing Anmpi-based Parallel and Distributed Machine Learning Platform on Large-scale Hpc Clusters

This paper presents the design of an MPI-based parallel and distributed machine learning platform on large-scale HPC clusters. Researchers and practitioners can implement easily a class of parallelizable machine learning algorithms on the platform, or port quickly an existing non-parallel implementation of a parallelizable algorithm to the platform with only minor modifications. Complicated fun...

متن کامل

Towards Dynamic Load Balancing Using Page Migration and Loop Re-partitioning on Omni/SCASH

Increasingly large-scale clusters of SMPs continue to become majority platform in HPC field. Such a cluster environment, there may be load imbalances due to several reasons and mis-placement of data which bring performance bottlenecks. To overcome these problems, some dynamic load balancing mechanisms are needed. In this paper, we report our ongoing work on dynamic load balancing extention to O...

متن کامل

MLCA: A Multi-Level Clustering Algorithm for Routing in Wireless Sensor Networks

Energy constraint is the biggest challenge in wireless sensor networks because the power supply of each sensor node is a battery that is not rechargeable or replaceable due to the applications of these networks. One of the successful methods for saving energy in these networks is clustering. It has caused that cluster-based routing algorithms are successful routing algorithm for these networks....

متن کامل

Dynamic Routing Balancing On InfiniBand Networks*

InfiniBand (IBA) technology was developed to address the performance issues associated with messages movement among Endnodes and computer I/O devices. However, InfiniBand is also widely deployed within high performance computing (HPC) clusters due to the high bandwidth and low message latency attributes it offers to inter-processor communication systems. An interconnection-network efficient des...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2011